<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>README</title>
  <style>
    html {
      line-height: 1.5;
      font-family: Georgia, serif;
      font-size: 20px;
      color: #1a1a1a;
      background-color: #fdfdfd;
    }
    body {
      margin: 0 auto;
      max-width: 36em;
      padding-left: 50px;
      padding-right: 50px;
      padding-top: 50px;
      padding-bottom: 50px;
      hyphens: auto;
      overflow-wrap: break-word;
      text-rendering: optimizeLegibility;
      font-kerning: normal;
    }
    @media (max-width: 600px) {
      body {
        font-size: 0.9em;
        padding: 1em;
      }
      h1 {
        font-size: 1.8em;
      }
    }
    @media print {
      body {
        background-color: transparent;
        color: black;
        font-size: 12pt;
      }
      p, h2, h3 {
        orphans: 3;
        widows: 3;
      }
      h2, h3, h4 {
        page-break-after: avoid;
      }
    }
    p {
      margin: 1em 0;
    }
    a {
      color: #1a1a1a;
    }
    a:visited {
      color: #1a1a1a;
    }
    img {
      max-width: 100%;
    }
    h1, h2, h3, h4, h5, h6 {
      margin-top: 1.4em;
    }
    h5, h6 {
      font-size: 1em;
      font-style: italic;
    }
    h6 {
      font-weight: normal;
    }
    ol, ul {
      padding-left: 1.7em;
      margin-top: 1em;
    }
    li > ol, li > ul {
      margin-top: 0;
    }
    blockquote {
      margin: 1em 0 1em 1.7em;
      padding-left: 1em;
      border-left: 2px solid #e6e6e6;
      color: #606060;
    }
    code {
      font-family: Menlo, Monaco, 'Lucida Console', Consolas, monospace;
      font-size: 85%;
      margin: 0;
    }
    pre {
      margin: 1em 0;
      overflow: auto;
    }
    pre code {
      padding: 0;
      overflow: visible;
      overflow-wrap: normal;
    }
    .sourceCode {
     background-color: transparent;
     overflow: visible;
    }
    hr {
      background-color: #1a1a1a;
      border: none;
      height: 1px;
      margin: 1em 0;
    }
    table {
      margin: 1em 0;
      border-collapse: collapse;
      width: 100%;
      overflow-x: auto;
      display: block;
      font-variant-numeric: lining-nums tabular-nums;
    }
    table caption {
      margin-bottom: 0.75em;
    }
    tbody {
      margin-top: 0.5em;
      border-top: 1px solid #1a1a1a;
      border-bottom: 1px solid #1a1a1a;
    }
    th {
      border-top: 1px solid #1a1a1a;
      padding: 0.25em 0.5em 0.25em 0.5em;
    }
    td {
      padding: 0.125em 0.5em 0.25em 0.5em;
    }
    header {
      margin-bottom: 4em;
      text-align: center;
    }
    #TOC li {
      list-style: none;
    }
    #TOC ul {
      padding-left: 1.3em;
    }
    #TOC > ul {
      padding-left: 0;
    }
    #TOC a:not(:hover) {
      text-decoration: none;
    }
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
  </style>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<h1
id="machine-learning-can-predict-shooting-victimization-well-enough-to-help-prevent-it">Machine
Learning Can Predict Shooting Victimization Well Enough To Help Prevent
It</h1>
<h2 id="overview">Overview</h2>
<p>This package contains the replication code for producing each
statistic, table, and figure included in the paper. The code responsible
for transforming raw input datasets into a clean data file for modeling
is written in a combination of R and python. The modeling code and code
for producing tables and figures is written in python.</p>
<h2 id="data-availability">Data availability</h2>
<p>Please note that the paper uses confidential individual-level
administrative data from the Chicago Police Department (CPD). We are
legally prohibited from releasing or sharing these datasets due to the
data use agreements in place. To support transparency and replication,
however, we are happy to connect interested researchers with our
contacts at CPD to assist with requests for the data.</p>
<h2 id="dataset-list">Dataset list</h2>
<p>The datasets we received for this analysis are listed below. For more
information, see Appendix A.1 of the paper.</p>
<h3 id="arrests">Arrests</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Offender and time/place information for arrests from 1999 to
present.</p>
<table>
<colgroup>
<col style="width: 30%" />
<col style="width: 69%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Arrest ID</td>
<td>ARREST_ID</td>
</tr>
<tr class="even">
<td>Crime ID</td>
<td>RD_NO</td>
</tr>
<tr class="odd">
<td>Arrest date</td>
<td>ARREST_DATE</td>
</tr>
<tr class="even">
<td>Arrest address</td>
<td>STREET_NO, STREET_DIRECTION_CD, STREET_NME, APT_NO, CITY, STATE_CD,
ZIP_CD</td>
</tr>
<tr class="odd">
<td>Police beat of the arrest</td>
<td>ARR_BEAT</td>
</tr>
<tr class="even">
<td>Arrest class (misdemeanor or felony)</td>
<td>CHARGE_TYPE_CD</td>
</tr>
<tr class="odd">
<td>Offender name</td>
<td>FIRST_NME, MIDDLE_NME, LAST_NME</td>
</tr>
<tr class="even">
<td>Offender DOB and age</td>
<td>BIRTH_DATE, AGE</td>
</tr>
<tr class="odd">
<td>Offender race and sex</td>
<td>SEX_CODE_CD, RACE_CODE_CD</td>
</tr>
<tr class="even">
<td>Offender home address</td>
<td>O_STREET_NO, O_STREET_DIRECTION_CD, O_STREET_NME, O_APT_NO, O_CITY,
O_STATE_CD, O_ZIP_CD</td>
</tr>
<tr class="odd">
<td>Offender ID (based on fingerprint)</td>
<td>IR_NO</td>
</tr>
<tr class="even">
<td>Offender gang affiliation</td>
<td>GANG_NAME, FACTION_NAME</td>
</tr>
</tbody>
</table>
<h3 id="charges">Charges</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Charge information for arrests from 1999 to present.</p>
<table>
<colgroup>
<col style="width: 31%" />
<col style="width: 68%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Charge ID</td>
<td>CHARGE_CODE_ID</td>
</tr>
<tr class="even">
<td>Arrest ID</td>
<td>ARREST_ID</td>
</tr>
<tr class="odd">
<td>Charge class (misdemeanor or felony)</td>
<td>CHARGE_TYPE_CD</td>
</tr>
<tr class="even">
<td>FBI code (i.e. charge type)</td>
<td>FBI_CODE</td>
</tr>
<tr class="odd">
<td>Charge statute description</td>
<td>DESCR</td>
</tr>
</tbody>
</table>
<h3 id="victimizations">Victimizations</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Victim and time/place/crime information for victimizations from 1999
to present.</p>
<table>
<colgroup>
<col style="width: 29%" />
<col style="width: 70%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Victimization ID</td>
<td>NO_ID</td>
</tr>
<tr class="even">
<td>Crime ID</td>
<td>RD</td>
</tr>
<tr class="odd">
<td>Victimization date</td>
<td>DATEOCC</td>
</tr>
<tr class="even">
<td>Police beat of the victimization</td>
<td>BEAT</td>
</tr>
<tr class="odd">
<td>FBI code (i.e. crime type)</td>
<td>VFBI_CD</td>
</tr>
<tr class="even">
<td>Crime statute description</td>
<td>VDESCRIPTION</td>
</tr>
<tr class="odd">
<td>Victim name</td>
<td>VFIRST, VMI, VLAST</td>
</tr>
<tr class="even">
<td>Victim DOB and age</td>
<td>VDOB, VAGE</td>
</tr>
<tr class="odd">
<td>Victim race and sex</td>
<td>VSEX, VRACE</td>
</tr>
<tr class="even">
<td>Victim home address</td>
<td>VSTNUM, VSTDIR, VSTREET, VAPT, VCITY, VSTATE, ZIP_CD</td>
</tr>
</tbody>
</table>
<h3 id="crimes">Crimes</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Additional crime type information for victimizations from 1999 to
present.</p>
<table>
<colgroup>
<col style="width: 29%" />
<col style="width: 70%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Crime ID</td>
<td>RD</td>
</tr>
<tr class="even">
<td>Domestic indicator</td>
<td>DOMESTIC_I</td>
</tr>
</tbody>
</table>
<h3 id="shooting-victimizations">Shooting victimizations</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Victim and time/place information for reported shooting crimes from
2011 to present.</p>
<table>
<colgroup>
<col style="width: 29%" />
<col style="width: 70%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Victimization ID</td>
<td>ID</td>
</tr>
<tr class="even">
<td>Crime ID</td>
<td>RD</td>
</tr>
<tr class="odd">
<td>Victimization date</td>
<td>DATEOCC</td>
</tr>
<tr class="even">
<td>Victim name</td>
<td>FIRST_NAME, LAST_NAME</td>
</tr>
<tr class="odd">
<td>Victim DOB and age</td>
<td>VDOB, VAGE</td>
</tr>
<tr class="even">
<td>Victim race and sex</td>
<td>VSEX, VRACE</td>
</tr>
<tr class="odd">
<td>FBI code (i.e. crime type)</td>
<td>FBI_CD</td>
</tr>
<tr class="even">
<td>Crime statute description</td>
<td>DESCRIPTION</td>
</tr>
<tr class="odd">
<td>Domestic indicator</td>
<td>DOMESTIC_I</td>
</tr>
<tr class="even">
<td>Fatal indicator</td>
<td>INBOX</td>
</tr>
</tbody>
</table>
<h3 id="homicide-victimizations">Homicide victimizations</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Victim and time/place information for reported homicides from 1982 to
present.</p>
<table>
<colgroup>
<col style="width: 30%" />
<col style="width: 69%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Homicide ID</td>
<td>ID</td>
</tr>
<tr class="even">
<td>Crime date</td>
<td>INJURY_DATE</td>
</tr>
<tr class="odd">
<td>Death date</td>
<td>DEATH_DATE</td>
</tr>
<tr class="even">
<td>Description of the injury (e.g. shot)</td>
<td>INJURY_DESCR</td>
</tr>
<tr class="odd">
<td>Victim name</td>
<td>VICTIM_FIRST_NME, VICTIM_MIDDLE_I, VICTIM_LAST_NME</td>
</tr>
<tr class="even">
<td>Victim age</td>
<td>VICTIM_AGE</td>
</tr>
<tr class="odd">
<td>Victim race and sex</td>
<td>VICTIM_SEX, VICTIM_RACE</td>
</tr>
<tr class="even">
<td>Victim ID (based on fingerprint)</td>
<td>VICTIM_IR_NO</td>
</tr>
</tbody>
</table>
<h3 id="suspected-homicide-offenders">Suspected homicide offenders</h3>
<blockquote>
<p>Source: Chicago Police Department</p>
</blockquote>
<p>Suspect information for reported homicides from 1982 to present.</p>
<table>
<colgroup>
<col style="width: 29%" />
<col style="width: 70%" />
</colgroup>
<thead>
<tr class="header">
<th>Variable Description</th>
<th>Actual Field(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Homicide ID</td>
<td>HOM_HOMICIDE_ID</td>
</tr>
<tr class="even">
<td>Suspect name</td>
<td>FIRST_NME, MIDDLE_INITIAL, LAST_NME</td>
</tr>
<tr class="odd">
<td>Suspect age</td>
<td>AGE</td>
</tr>
<tr class="even">
<td>Suspect race and sex</td>
<td>SEX, RACE</td>
</tr>
<tr class="odd">
<td>Suspect ID (based on fingerprint)</td>
<td>IR_NO</td>
</tr>
</tbody>
</table>
<h2 id="description-of-programscode">Description of programs/code</h2>
<p>The code is partitioned into three parts: data processing, modeling,
and table/figure creation.</p>
<h3 id="data-processing">Data processing</h3>
<blockquote>
<p>Folder: data_prep</p>
</blockquote>
<ul>
<li><code>compute_victim_indices</code>: Creates crime type indicators
using FBI code and crime description (e.g. domestic, violent,
property).</li>
<li><code>compute_charge_indices</code>: Creates arresting charge type
indicators using FBI code and charge description (e.g. domestic,
violent, drug).</li>
<li><code>compute_periods</code>: Determines the start and end date for
each cohort’s inclusion period and outcome period.</li>
<li><code>compute_sample_clusters</code>: Determines the individuals
(clusters) in each cohort.</li>
<li><code>windowed_features</code>: Computes features counting each
individual’s prior victimizations and arrests of a given type
(e.g. number of drug arrests in the past 2 years).</li>
<li><code>time_since_features</code>: Computes features indicating the
number of days since each individual’s last victimization or arrest of a
given type (e.g. days since last drug arrest).</li>
<li><code>modal_and_most_recent_features</code>: Computes features for
demographic information (e.g. modal race, most recent arrest beat).</li>
<li><code>make_networks</code>: Builds a network of individuals based on
co-arrests and co-victimizations.</li>
<li><code>compute_network_metrics</code>: Computes features related to
netowrk connectivity (e.g. number of co-arrestees/co-victims).</li>
<li><code>compute_windowed_neighbor_histories</code>: Computes features
for the number prior arrests and victimizations of a given type, among
each individual’s co-arrestees and co-victims.</li>
<li><code>compute_outcomes</code>: Computes outcome measures indicating
victimization of a given type within a given time period (e.g. gun crime
victim within 18 months).</li>
</ul>
<h3 id="modeling">Modeling</h3>
<blockquote>
<p>Folder: prediction</p>
</blockquote>
<ul>
<li><code>generate_feature_sets</code>: Combines the various types of
features permitted by a given modeling specification into a single
file.</li>
<li><code>simple_feature_sets</code>: Identifies the n most important
features for a given feature set specification (used for the simple
model specifications).</li>
<li><code>build_models</code>: Fits a gradient boosting tree or a linear
model for a given feature set and outcome.</li>
</ul>
<h3 id="table-and-figure-creation">Table and figure creation</h3>
<blockquote>
<p>Folder: tables_and_figures</p>
</blockquote>
<ul>
<li><code>Create Tables and Figures</code>: Loads outputs from previous
tasks in the codebase, computes the statistics shared in the paper’s
text, and generates all tables and figures.</li>
</ul>
<p>This code is licensed under a GPLv3 license. See LICENSE for
details.</p>
<h2 id="list-of-tables-and-figures">List of tables and figures</h2>
<p>The provided code reproduces all tables and quantitative figures in
the paper, as detailed below.</p>
<table>
<thead>
<tr class="header">
<th>Figure/Table #</th>
<th>Output file</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Table 1</td>
<td>overall_performance_ml_vs_ols.tex</td>
</tr>
<tr class="even">
<td>Table A.1</td>
<td>N/A (manually entered)</td>
</tr>
<tr class="odd">
<td>Table A.2</td>
<td>N/A (manually entered)</td>
</tr>
<tr class="even">
<td>Table B.1</td>
<td>overall_performance_outcome_duration.tex</td>
</tr>
<tr class="odd">
<td>Table B.2</td>
<td>overall_performance_modeling_scheme.tex</td>
</tr>
<tr class="even">
<td>Table B.3</td>
<td>overall_performance_outcomes.tex</td>
</tr>
<tr class="odd">
<td>Table B.4</td>
<td>full_shooting_victim_demographics.tex</td>
</tr>
<tr class="even">
<td>Table B.5</td>
<td>sample_shot_alt.tex</td>
</tr>
<tr class="odd">
<td>Table B.6</td>
<td>top_4244_shot.tex</td>
</tr>
<tr class="even">
<td>Table B.7</td>
<td>overall_performance_feature_sets.tex</td>
</tr>
<tr class="odd">
<td>Table B.8</td>
<td>features_oaat_all.tex</td>
</tr>
<tr class="even">
<td>Table B.9</td>
<td>features_oaat_no_networks.tex</td>
</tr>
<tr class="odd">
<td>Table B.10</td>
<td>features_oaat_no_arrests.tex</td>
</tr>
<tr class="even">
<td>Table B.11</td>
<td>features_oaat_no_networks_no_arrests.tex</td>
</tr>
<tr class="odd">
<td>Table B.12</td>
<td>overall_performance_top_50_leave_out.tex</td>
</tr>
<tr class="even">
<td>Figure 1</td>
<td>calibration_by_race_2x2.png</td>
</tr>
<tr class="odd">
<td>Figure 2</td>
<td>precision.png, recall.png</td>
</tr>
<tr class="even">
<td>Figure 3</td>
<td>stacked_demo_bar.png</td>
</tr>
<tr class="odd">
<td>Figure 4</td>
<td>feature_sets_precision_ci.png</td>
</tr>
<tr class="even">
<td>Figure A.1</td>
<td>N/A (non-quantitaive)</td>
</tr>
<tr class="odd">
<td>Figure B.1</td>
<td>shooting_victim_demographics.png</td>
</tr>
<tr class="even">
<td>Figure B.2</td>
<td>feature_sets_precision_ci_long.png</td>
</tr>
<tr class="odd">
<td>Figure B.3</td>
<td>feature_sets_precision_full_sample.png</td>
</tr>
<tr class="even">
<td>Figure B.4</td>
<td>no_race_calibration_by_race.png</td>
</tr>
<tr class="odd">
<td>Figure B.5</td>
<td>n_features_precision_panels_ci.png</td>
</tr>
</tbody>
</table>
<h2 id="computational-requirements">Computational requirements</h2>
<h3 id="software-requirements">Software Requirements</h3>
<ul>
<li>Python (3.6.15)</li>
<li>R (3.6.3)</li>
</ul>
<p>The codebase relies on GNU Makefiles to automate the running of the
each script in sequence. Use of this functionality requires access to a
bash terminal, like the one available on Linux machines.</p>
<h3 id="memory-and-runtime-requirements">Memory and Runtime
Requirements</h3>
<p>The code was run on a large shared machine running Ubuntu 22.04 with
30 Intel(R) Xeon(R) CPUs and 512 gigabytes of RAM. Using a small portion
of the available resources, the total runtime of the code is about 2
weeks. The vast majority of this runtime can be attributed to the the
<code>build_models</code> step, which is run once for each of the many
model specifications reported in the paper.</p>
<style>
body { min-width: 80% !important; }
</style>
</body>
</html>
